User Inputs

output.var = params$output.var 
transform.abs = params$transform.abs
log.pred = params$log.pred
eda = params$eda
algo.forward = params$algo.forward
algo.backward = params$algo.backward
algo.stepwise = params$algo.stepwise
algo.LASSO = params$algo.LASSO
algo.LARS = params$algo.LARS

message("Parameters used for training/prediction: ")
## Parameters used for training/prediction:
str(params)
## List of 9
##  $ output.var   : chr "y3"
##  $ transform.abs: logi FALSE
##  $ log.pred     : logi FALSE
##  $ eda          : logi FALSE
##  $ algo.forward : logi FALSE
##  $ algo.backward: logi FALSE
##  $ algo.stepwise: logi FALSE
##  $ algo.LASSO   : logi TRUE
##  $ algo.LARS    : logi FALSE
# Setup Labels
# alt.scale.label.name = Alternate Scale variable name
#   - if predicting on log, then alt.scale is normal scale
#   - if predicting on normal scale, then alt.scale is log scale
if (log.pred == TRUE){
  label.names = paste('log.',output.var,sep="")
  alt.scale.label.name = output.var
}
if (log.pred == FALSE){
  label.names = output.var
  alt.scale.label.name = paste('log.',output.var,sep="")
}

Loading Data

feat  = read.csv('../../Data/features.csv')
labels = read.csv('../../Data/labels.csv')
predictors = names(dplyr::select(feat,-JobName))
target = 'y3'
data.ori = inner_join(feat,select_at(labels,c('JobName',target)),by='JobName')

Data validation

cc  = complete.cases(data.ori)
data.notComplete = data.ori[! cc,]
data = data.ori[cc,]
message('Non-Complete cases: ',nrow(data.notComplete))
## Non-Complete cases: 2497
message('Complete cases: ',nrow(data))
## Complete cases: 7503

Normality and Variance

Target Variable

The Target Variable y3 shows right skewness, so we suggest a log transofrmation (Feature Eng Section)

Histogram

ggplot(gather(select_at(data,target)), aes(value)) + 
  geom_histogram(aes(y=..density..),bins = 50,fill='light blue') + 
  geom_density() + 
  facet_wrap(~key, scales = 'free',ncol=4)

QQPlot

ggplot(gather(select_at(data,target)), aes(sample=value)) + 
  stat_qq() + 
  facet_wrap(~key, scales = 'free',ncol=4)

Predictors

All predictors show a Fat-Tail situation, where the two tails are very tall, and a low distribution around the mean. The orderNorm transromation can help (see [Best Normalizator] section)

Interesting Predictors.

cols = c('x11','x18')
ggplot(gather(select_at(data,cols)), aes(value)) + 
  geom_histogram(aes(y=..density..),bins = 50,fill='light blue') + 
  geom_density() + 
  facet_wrap(~key, scales = 'free',ncol=4)

ggplot(gather(select_at(data,cols)), aes(sample=value)) + 
  stat_qq()+
  facet_wrap(~key, scales = 'free',ncol=4)

lapply(select_at(data,cols),summary)
## $x11
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 9.000e-08 9.500e-08 1.000e-07 1.001e-07 1.050e-07 1.100e-07 
## 
## $x18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   3.128   4.769   4.769   6.415   7.999

Best Normalizator X18

Normalization of X18 using bestNormalize package. (suggested orderNorm) This is cool, but I think is too far for the objective of the project

t=bestNormalize::bestNormalize(data$x18)
## Warning in orderNorm(standardize = TRUE, warn = TRUE, x = c(4.76747513, : Ties in data, Normal distribution not guaranteed
t
## Best Normalizing transformation with 7503 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - No transform: 8.2332 
##  - Box-Cox: 8.0085 
##  - Log_b(x+a): 10.24 
##  - sqrt(x+a): 8.0772 
##  - exp(x): 124.9108 
##  - arcsinh(x): 9.7688 
##  - Yeo-Johnson: 8.326 
##  - orderNorm: 1.1011 
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## orderNorm Transformation with 7503 nonmissing obs and ties
##  - 7500 unique values 
##  - Original quantiles:
##    0%   25%   50%   75%  100% 
## 1.500 3.128 4.769 6.415 7.999
newx18 = predict(t)
qqnorm(data$x18)

qqnorm(newx18)

orderNorm() is a rank-based procedure by which the values of a vector are mapped to their percentile, which is then mapped to the same percentile of the normal distribution. Without the presence of ties, this essentially guarantees that the transformation leads to a uniform distribution


Best Normalizator X11

Normalization of X11 using bestNormalize package. (suggested orderNorm) This is cool, but I think is too far for the objective of the project

t=bestNormalize::bestNormalize(data$x11)
## Warning in orderNorm(standardize = TRUE, warn = TRUE, x = c(1.05e-07, 1.03e-07, : Ties in data, Normal distribution not guaranteed
t
## Best Normalizing transformation with 7503 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - No transform: 13.8579 
##  - Box-Cox: 13.7962 
##  - Log_b(x+a): 13.8579 
##  - sqrt(x+a): 13.8579 
##  - exp(x): 13.8579 
##  - arcsinh(x): 13.8579 
##  - Yeo-Johnson: 13.8579 
##  - orderNorm: 7.1736 
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## orderNorm Transformation with 7503 nonmissing obs and ties
##  - 111 unique values 
##  - Original quantiles:
##   0%  25%  50%  75% 100% 
##    0    0    0    0    0
qqnorm(data$x11)

qqnorm( predict(t))

orderNorm() is a rank-based procedure by which the values of a vector are mapped to their percentile, which is then mapped to the same percentile of the normal distribution. Without the presence of ties, this essentially guarantees that the transformation leads to a uniform distribution

Histograms

All indicators have a strong indication of Fat-Tails

ggplot(gather(select_at(data,predictors)), aes(value)) + 
  geom_histogram(aes(y=..density..),bins = 50,fill='light blue') + 
  geom_density() + 
  facet_wrap(~key, scales = 'free',ncol=4)

QQPlots

ggplot(gather(select_at(data,predictors)), aes(sample=value)) + 
  stat_qq() + 
  facet_wrap(~key, scales = 'free',ncol=4)

Correlations

With Target Variable

#chart.Correlation(select(data,-JobName),  pch=21)
t=round(cor(dplyr::select(data,-one_of(target,'JobName')),select_at(data,target)),4)
DT::datatable(t)

All Variables

#chart.Correlation(select(data,-JobName),  pch=21)
t=round(cor(dplyr::select(data,-one_of('JobName'))),4)
DT::datatable(t,options=list(scrollX=T))

Scatter Plots with Target Variable

Scatter plots with all predictors and the target variable (y3)

d = gather(dplyr::select_at(data,c(predictors,target)),key=target,value=value,-y3)
ggplot(data=d, aes(x=value,y=y3)) + 
  geom_point(color='light blue',alpha=0.5) + 
  geom_smooth() + 
  facet_wrap(~target, scales = 'free',ncol=4)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Multicollinearity - VIF

No Multicollinearity among predictors

Showing Top predictor by VIF Value

vifDF = usdm::vif(select_at(data,predictors)) %>% arrange(desc(VIF))
head(vifDF,10)
##    Variables      VIF
## 1     stat14 1.061347
## 2    stat142 1.061307
## 3    stat154 1.060047
## 4    stat178 1.059169
## 5        x20 1.059153
## 6     stat80 1.059151
## 7     stat86 1.059038
## 8     stat20 1.058919
## 9     stat12 1.058504
## 10    stat35 1.058189

Feature Eng

  • No trasnformation for x18

  • log transformatio for y3

df=data %>%
  mutate(x18sqrt = sqrt(x18)
         ,y3log = log(y3)
         ) 
target='y3log'
cols=c('y3','y3log','x18','x18sqrt')

Density Plots

pre and post trasnformation

ggplot(gather(select_at(df,cols)), aes(value)) + 
  geom_histogram(aes(y=..density..),bins = 50,fill='light blue') + 
  geom_density() + 
  facet_wrap(~key, scales = 'free',ncol=4)

Scatter Plots

Vs y3log

cols2=cols[!cols %in% c('y3')]
d = gather(dplyr::select_at(df,cols2),key=target,value=value,-y3log)
ggplot(data=d, aes(x=value,y=y3log)) + 
  geom_point(color='light blue',alpha=0.5) + 
  geom_smooth() + 
  facet_wrap(~target, scales = 'free',ncol=4)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

#removing unwanted variables
df=df %>%
  dplyr::select(-x18sqrt,-y3)

Scatter Plots with transformed Target Variable

Scatter plots with all predictors and the transformed target variable (y3LOG)

d = gather(dplyr::select_at(df,c(predictors,target)),key=target,value=value,-y3log)
ggplot(data=d, aes(x=value,y=y3log)) + 
  geom_point(color='light green',alpha=0.5) + 
  geom_smooth() + 
  facet_wrap(~target, scales = 'free',ncol=4)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Conclusion

  • the target ariable y3 can be LOG transformed

  • the predictor x18 is not improving with SQR trasformatioatn

  • all predictors could benefit with a orderNorm transformation